Overview

This document contains instructions on Project 1 for STA 207 in Winter 2021. This document is made with R markdown. The rmd file to generate this document is available on the course website.

Background

In this project, we study the dataset from a very influential randomized experiment. The Tennesses Student/Teacher Achievement Ratio study (a.k.a. Project STAR) was conducted in the late 1980s to evaluate the effect of class sizes on test scores. This dataset has been used as a classic examples in many textbooks and research papers. You are encouraged to read more about the experiment design and how others analyze this dataset. This document only provides a brief explanation of the dataset for this course project.

The study randomly assigned students to small classes, regular classes, and regular classes with a teacher’s aide. In order to randomize properly, schools were enrolled only if they had enough student body to have at least one class of each type. Once the schools were enrolled, students were randomly assigned to the three types of classes, and one teacher was randomly assigned to each class.

The dataset contains scaled scores for math and reading from kindergarten to 3rd grade. We will only examine the math scores in 1st grade in this project. The primary question of interest is whether there is any differences in math scaled scores in 1st grade across class types, and if so, a secondary question of interest is which class type is associated with the highest math scaled scores in 1st grade. In particular, we will treat each teacher as the basic unit of our analysis. To put it in another way, we will treat each class (uniquely identified by its assigned teacher) as an observation. Noting that there are multiple students in each class, some data manipulation are warranted.

Suggested outline

The following list provides one potential structure of the data analysis report. A detailed template for this project is provided in a separate RMD file. The detailed template is provided as a learning tool for the first data analysis project in this course. There will not be any templates for future projects.

library(ggplot2)
library(MASS)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(gplots)
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble  3.0.4     ✓ purrr   0.3.4
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x dplyr::select() masks MASS::select()

1. Introduce the data set and the questions of interest.

“The dataset contains scaled scores for math and reading from kindergarten to 3rd grade. We will only examine the math scores in 1st grade in this project. The primary question of interest is whether there is any differences in math scaled scores in 1st grade across class types, and if so, a secondary question of interest is which class type is associated with the highest math scaled scores in 1st grade. In particular, we will treat each teacher as the basic unit of our analysis. To put it in another way, we will treat each class (uniquely identified by its assigned teacher) as an observation. Noting that there are multiple students in each class, some data manipulation are warranted.”

2. Review the background of Project STAR, and find the data set from the Internet.

students = read.csv('STAR_Students.tab', sep='\t')
#head(students)
#colnames(students)
length(rownames(students)) # total number of observations
## [1] 11601

3. Explore this dataset and generate summary statistics that you find informative, and explain your findings. In particular,

First lets consider some data cleaning

students_bu = students
students$g1classtype = as.integer(students$g1classtype)
students = students %>% filter(!is.na(g1tmathss))
students = students %>% filter(!is.na(g1classtype))
students = students %>% filter(!is.na(g1schid))

#students$g1mathss = students[!is.na(students$g1mathss)]
#pairwise.t.test(x=students$g1mathss, g=students$g1classtype, p.adjust.method = 'bonf')
#head(students)
length(rownames(students))
## [1] 6598

a. obtain math scaled scores in the 1st grade with teachers as the unit,

teach_agg = aggregate(list(students$g1tmathss, students$g1schid, students$g1classtype, students$g1classsize),
          by = list(students$g1tchid),              
          FUN = mean) 

colnames(teach_agg) = c('g1tchid', 'g1tmathss', 'g1schid', 'g1classtype', 'g1classsize' )
#teach_agg$class_type = teach_class_agg$x
head(teach_agg)
##    g1tchid g1tmathss g1schid g1classtype g1classsize
## 1 11203804  486.8750  112038           3          28
## 2 11203805  499.8235  112038           1          17
## 3 11203806  496.0370  112038           2          28
## 4 12305604  523.7391  123056           2          25
## 5 12305605  532.8000  123056           3          27
## 6 12305606  534.4706  123056           1          19

b. and investigate the relationship between school indicator, class types, and math scaled scores in 1st grade.

#boxplot(students$g1mathbsobjpct, as.factor(students$g1classtype))
hist(teach_agg$g1tmathss)

summary(as.factor(teach_agg$g1classtype))
##   1   2   3 
## 124 115 100
teach_agg$g1schid = as.factor(teach_agg$g1schid)
teach_agg$g1classtype = as.factor(teach_agg$g1classtype)
ggplot(teach_agg, aes(x=g1classtype, y=g1tmathss, fill=g1tmathss)) +
    geom_boxplot(alpha=0.7) +
    stat_summary(fun=mean, geom="point", shape=20, size=5, color="blue", fill="black") + 
    labs(title='Class Type vs Math Scaled Score', x='Class Type', y='Math Scaled Score')

ggplot(teach_agg, aes(x=g1schid, y=g1tmathss, fill=g1tmathss)) +
    geom_boxplot(alpha=0.7) +
    stat_summary(fun=mean, geom="point", shape=20, size=2, color="blue", fill="black") +
    labs(title='School ID vs Math Scaled Score', x='School ID', y='Math Scaled Score') +
    theme(axis.text.x = element_text(angle = 90))

# Main effects plot
plotmeans(g1tmathss~g1classtype,data=teach_agg,xlab="Class Type",ylab="Math Scaled Score", main="Main effect, Class Type") 

# Main effects plot
?plotmeans
plotmeans(g1tmathss~g1schid,data=teach_agg,n.label=FALSE,xlab="School ID",ylab="Math Scaled Score", main="Main effect, School ID") 
## Warning in arrows(x, li, x, pmax(y - gap, li), col = barcol, lwd = lwd, : zero-
## length arrow is of indeterminate angle and so skipped

## Warning in arrows(x, li, x, pmax(y - gap, li), col = barcol, lwd = lwd, : zero-
## length arrow is of indeterminate angle and so skipped

## Warning in arrows(x, li, x, pmax(y - gap, li), col = barcol, lwd = lwd, : zero-
## length arrow is of indeterminate angle and so skipped

## Warning in arrows(x, li, x, pmax(y - gap, li), col = barcol, lwd = lwd, : zero-
## length arrow is of indeterminate angle and so skipped

## Warning in arrows(x, li, x, pmax(y - gap, li), col = barcol, lwd = lwd, : zero-
## length arrow is of indeterminate angle and so skipped
## Warning in arrows(x, ui, x, pmin(y + gap, ui), col = barcol, lwd = lwd, : zero-
## length arrow is of indeterminate angle and so skipped

## Warning in arrows(x, ui, x, pmin(y + gap, ui), col = barcol, lwd = lwd, : zero-
## length arrow is of indeterminate angle and so skipped

## Warning in arrows(x, ui, x, pmin(y + gap, ui), col = barcol, lwd = lwd, : zero-
## length arrow is of indeterminate angle and so skipped

## Warning in arrows(x, ui, x, pmin(y + gap, ui), col = barcol, lwd = lwd, : zero-
## length arrow is of indeterminate angle and so skipped

## Warning in arrows(x, ui, x, pmin(y + gap, ui), col = barcol, lwd = lwd, : zero-
## length arrow is of indeterminate angle and so skipped

4. Propose an appropriate model to answer the questions of interest. The model should include the school indicator as a factor/regressor to adjust for.

The null hypothesis for the primary question of interest is \(H_0 : \alpha_1 = \alpha_2 = \alpha_3 = 0\), and the alternative is \(H_a\) : not all \(\alpha\)s are zero. You can find the test statistic and p-value using summary(anova.fit), if you save your fitted model as anova.fit. Please be sure specify the significance level and interpret your test result. Explain any additional assumptions involved in this test.

length(unique(teach_agg$g1schid))
## [1] 76

a. Explain your notation.

  • We can define a two-way ANOVA model as follows

  • \(Y_{ijk} = \mu_{..} + \alpha_{i} + \beta_{j} + \epsilon_{ijk}\),

  • where the index \(i\) represents the class type: small (\(i=1\)), regular (\(i=2\)), regular with aide (\(i=3\)),

  • and the index \(j\) represents the school indicator \(j=1....76\) based on the result shown above. .

  • \(\alpha_i\) is the main effect of the factor class type.

  • \(\beta_j\) is the main effect of the factor school ID.

  • You need to explain the rest of the parameters, state constraints on the parameters, and justify the choice of model (e.g., why no interaction terms).

b. State assumptions for your model.

hist(students$g1tmathss)

c. Explain why your model is appropriate for this task.

  • The interaction term is not considered, in other words the model is an additive model. This is fitting given the situation where the interaction term between school ID and class type ID should not be considered since it seems to not be meaningful since a teacher only has one class type which associates with one school.
  • In addtion, the computational demands of this command were too high and would require an HPC.

5. Fit the proposed model in (4) and explain your results.

fit_tch_sch = lm(g1tmathss ~ g1schid + g1classtype, data = teach_agg)
anova(fit_tch_sch)
## Analysis of Variance Table
## 
## Response: g1tmathss
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## g1schid      75 136424  1819.0  6.5733 < 2.2e-16 ***
## g1classtype   2  12026  6013.1 21.7298 1.866e-09 ***
## Residuals   261  72225   276.7                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Is it worth considering the interaction term?

fit = aov(g1tmathss ~ g1classtype + g1schid + g1classtype:g1schid, data = teach_agg)
summary(fit)
##                      Df Sum Sq Mean Sq F value   Pr(>F)    
## g1classtype           2  11617    5809  19.936 3.69e-08 ***
## g1schid              75 136833    1824   6.262  < 2e-16 ***
## g1classtype:g1schid 146  38718     265   0.910    0.706    
## Residuals           115  33507     291                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit = aov(g1tmathss ~ g1classtype + g1schid, data = teach_agg)
summary(fit)
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## g1classtype   2  11617    5809  20.991 3.52e-09 ***
## g1schid      75 136833    1824   6.593  < 2e-16 ***
## Residuals   261  72225     277                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

6. Conduct model diagnostics and/or sensitivity analysis.

plot(fit)

boxcox(fit)

#?boxcox
fit2 = aov(g1tmathss^-2 ~ g1classtype + g1schid, data = teach_agg)
summary(fit2)
##              Df    Sum Sq   Mean Sq F value   Pr(>F)    
## g1classtype   2 1.967e-12 9.833e-13  20.588 4.99e-09 ***
## g1schid      75 2.512e-11 3.350e-13   7.014  < 2e-16 ***
## Residuals   261 1.247e-11 4.780e-14                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Pairwise t test to see what groups are best from our anova and lets use a few different p adjustment methods for multiple testing.

test=pairwise.t.test(teach_agg$g1tmathss^-2, teach_agg$g1classtype)
test$p.value
##              1         2
## 2 0.0001606678        NA
## 3 0.0143422668 0.2251094
pairwise.t.test(teach_agg$g1tmathss^-2, teach_agg$g1classtype, p.adjust.method = 'bonf')
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  teach_agg$g1tmathss^-2 and teach_agg$g1classtype 
## 
##   1       2      
## 2 0.00016 -      
## 3 0.02151 0.67533
## 
## P value adjustment method: bonferroni
pairwise.t.test(teach_agg$g1tmathss^-2, teach_agg$g1classtype, p.adjust.method = 'BH')
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  teach_agg$g1tmathss^-2 and teach_agg$g1classtype 
## 
##   1       2      
## 2 0.00016 -      
## 3 0.01076 0.22511
## 
## P value adjustment method: BH
alpha=0.05;
T.ci=TukeyHSD(fit2,conf.level = 1-alpha)

# The highest wage is in management, and the second largest is in technical
get("g1classtype", T.ci)
##              diff           lwr          upr        p adj
## 2-1  1.771954e-07  1.105047e-07 2.438860e-07 4.640207e-09
## 3-1  1.216169e-07  5.237978e-08 1.908539e-07 1.382845e-04
## 3-2 -5.557851e-08 -1.260147e-07 1.485769e-08 1.525267e-01
par(mfrow=c(2,2))
plot(fit2)

boxcox(fit2)

  • Looks a lot better now based on the boxcox log likelihood plot. In addtion, there appear to be some improvements in the Normal QQ plot but there is no stark difference.

Lets do this for all of the math scaled scores for the following years with grade 1 class type and school to see if the long term impact reported by other studies is true. If it isnt lets look at the corresponding years data instead as in the same school, class type, and math score for the same year.

my_star <- function(mathvar, classtypevar, schoolvar, teachervar){

  students = read.csv('STAR_Students.tab', sep='\t')
  #students$g1classtype = as.integer(students$c)
  students = students %>% filter(!is.na(!!as.symbol(mathvar)))
  students = students %>% filter(!is.na(!!as.symbol(teachervar)))
  students = students %>% filter(!is.na(!!as.symbol(schoolvar)))
  students = students %>% filter(!is.na(!!as.symbol(classtypevar)))
  teach_agg <- aggregate(list(students[,mathvar], students[,schoolvar], students[,classtypevar]),
        by = list(students[,teachervar]),              
        FUN = mean) 
  colnames(teach_agg) = c(teachervar, mathvar, schoolvar, classtypevar)
  teach_agg$classtypevar = as.factor(teach_agg[,classtypevar])
  teach_agg$schoolvar = as.factor(teach_agg[,schoolvar])
  teach_agg$mathvar = teach_agg[,mathvar]

  print(head(teach_agg))
  fit_my_star = aov(mathvar ~ classtypevar + schoolvar, data = teach_agg)
  return(teach_agg)
  # transformation should be considered for every scenario seperately
}

my_star_stat<- function(agg, power,fit){
  print(summary(fit))
  par(mfrow=c(2,2))
  plot(fit)
  # Pairwise t test with different correction methods
  print(pairwise.t.test(teach_agg$mathvar^power, teach_agg$classtypevar)$p.value)
  print(pairwise.t.test(teach_agg$mathvar^power, teach_agg$classtypevar, p.adjust.method = 'bonf')$p.value)
  print(pairwise.t.test(teach_agg$mathvar^power, teach_agg$classtypevar, p.adjust.method = 'BH')$p.value)
  # One approach is to use the Tukey’s method to construction simultaneous confidence intervals for all pairwise comparisons.
  alpha=0.05;
  T.ci=TukeyHSD(fit,conf.level = 1-alpha)
  print(T.ci$`teach_agg$classtypevar`)
}

teach_agg = my_star('g1tmathss', 'g1classtype', 'g1schid', 'g1tchid')
##    g1tchid g1tmathss g1schid g1classtype classtypevar schoolvar  mathvar
## 1 11203804  486.8750  112038           3            3    112038 486.8750
## 2 11203805  499.8235  112038           1            1    112038 499.8235
## 3 11203806  496.0370  112038           2            2    112038 496.0370
## 4 12305604  523.7391  123056           2            2    123056 523.7391
## 5 12305605  532.8000  123056           3            3    123056 532.8000
## 6 12305606  534.4706  123056           1            1    123056 534.4706
fit = aov(teach_agg$mathvar ~ teach_agg$classtypevar + teach_agg$schoolvar);
boxcox(fit)

new_fit = aov(teach_agg$mathvar^-2 ~ teach_agg$classtypevar + teach_agg$schoolvar)
boxcox(new_fit)

my_star_stat(teach_agg, -2, new_fit)
##                         Df    Sum Sq   Mean Sq F value   Pr(>F)    
## teach_agg$classtypevar   2 1.967e-12 9.833e-13  20.588 4.99e-09 ***
## teach_agg$schoolvar     75 2.512e-11 3.350e-13   7.014  < 2e-16 ***
## Residuals              261 1.247e-11 4.780e-14                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##              1         2
## 2 0.0001606678        NA
## 3 0.0143422668 0.2251094
##              1         2
## 2 0.0001606678        NA
## 3 0.0215134003 0.6753281
##              1         2
## 2 0.0001606678        NA
## 3 0.0107567001 0.2251094
##              diff           lwr          upr        p adj
## 2-1  1.771954e-07  1.105047e-07 2.438860e-07 4.640207e-09
## 3-1  1.216169e-07  5.237978e-08 1.908539e-07 1.382845e-04
## 3-2 -5.557851e-08 -1.260147e-07 1.485769e-08 1.525267e-01

Grade 2 with grade 1 class type, school, and teacher id (class grouping)

teach_agg = my_star('g2tmathss', 'g1classtype', 'g1schid', 'g1tchid')
##    g1tchid g2tmathss g1schid g1classtype classtypevar schoolvar  mathvar
## 1 11203804  546.4286  112038           3            3    112038 546.4286
## 2 11203805  549.0000  112038           1            1    112038 549.0000
## 3 11203806  558.0000  112038           2            2    112038 558.0000
## 4 12305604  591.6667  123056           2            2    123056 591.6667
## 5 12305605  598.8000  123056           3            3    123056 598.8000
## 6 12305606  580.2500  123056           1            1    123056 580.2500
fit = aov(teach_agg$mathvar ~ teach_agg$classtypevar + teach_agg$schoolvar);
boxcox(fit)

new_fit = aov(teach_agg$mathvar^-1 ~ teach_agg$classtypevar + teach_agg$schoolvar)
boxcox(new_fit)

my_star_stat(teach_agg, -1, new_fit)
##                         Df    Sum Sq   Mean Sq F value   Pr(>F)    
## teach_agg$classtypevar   2 6.390e-08 3.195e-08  12.764 5.22e-06 ***
## teach_agg$schoolvar     74 1.205e-06 1.629e-08   6.507  < 2e-16 ***
## Residuals              254 6.357e-07 2.500e-09                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##            1         2
## 2 0.00259797        NA
## 3 0.13809623 0.1644885
##            1         2
## 2 0.00259797        NA
## 3 0.20714435 0.4934654
##            1         2
## 2 0.00259797        NA
## 3 0.10357218 0.1644885
##              diff           lwr          upr        p adj
## 2-1  3.301297e-05  1.755248e-05 4.847345e-05 2.718957e-06
## 3-1  1.860550e-05  2.547198e-06 3.466381e-05 1.844975e-02
## 3-2 -1.440746e-05 -3.068783e-05 1.872898e-06 9.468932e-02

Grade 3 with grade 1 class type, chool, and teacher id (class grouping)

teach_agg = my_star('g3tmathss', 'g1classtype', 'g1schid', 'g1tchid')
##    g1tchid g3tmathss g1schid g1classtype classtypevar schoolvar  mathvar
## 1 11203804  610.0000  112038           3            3    112038 610.0000
## 2 11203805  590.7500  112038           1            1    112038 590.7500
## 3 11203806  609.3333  112038           2            2    112038 609.3333
## 4 12305604  619.8462  123056           2            2    123056 619.8462
## 5 12305605  634.8421  123056           3            3    123056 634.8421
## 6 12305606  618.6154  123056           1            1    123056 618.6154
fit = aov(teach_agg$mathvar ~ teach_agg$classtypevar + teach_agg$schoolvar);
boxcox(fit)

new_fit = aov(log(teach_agg$mathvar) ~ teach_agg$classtypevar + teach_agg$schoolvar)
boxcox(new_fit)

# do this one custom
teach_agg$g3tmathss = log(teach_agg$g3tmathss)
my_star_stat(teach_agg, 1, new_fit)
##                         Df  Sum Sq  Mean Sq F value   Pr(>F)    
## teach_agg$classtypevar   2 0.01187 0.005933  10.355 4.74e-05 ***
## teach_agg$schoolvar     74 0.26348 0.003560   6.214  < 2e-16 ***
## Residuals              256 0.14667 0.000573                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##            1         2
## 2 0.01969451        NA
## 3 0.02123816 0.9504549
##            1  2
## 2 0.01969451 NA
## 3 0.03185724  1
##            1         2
## 2 0.01592862        NA
## 3 0.01592862 0.9504549
##              diff          lwr          upr        p adj
## 2-1 -0.0125013238 -0.019868577 -0.005134070 0.0002438196
## 3-1 -0.0122564119 -0.019910777 -0.004602046 0.0005818623
## 3-2  0.0002449119 -0.007544054  0.008033877 0.9969753167

Grade 2 with grade 2 class type, chool, and teacher id (class grouping)

teach_agg = my_star('g2tmathss', 'g2classtype', 'g2schid', 'g2tchid')
##    g2tchid g2tmathss g2schid g2classtype classtypevar schoolvar  mathvar
## 1 11203807  563.3500  112038           2            2    112038 563.3500
## 2 11203808  548.0000  112038           3            3    112038 548.0000
## 3 11203809  562.1333  112038           1            1    112038 562.1333
## 4 12305607  595.4500  123056           2            2    123056 595.4500
## 5 12305608  600.3810  123056           3            3    123056 600.3810
## 6 12305609  570.1538  123056           1            1    123056 570.1538
fit = aov(teach_agg$mathvar ~ teach_agg$classtypevar + teach_agg$schoolvar);
boxcox(fit)
new_fit = aov(teach_agg$mathvar^1 ~ teach_agg$classtypevar + teach_agg$schoolvar)
boxcox(new_fit)

my_star_stat(teach_agg, 1, new_fit)
##                         Df Sum Sq Mean Sq F value Pr(>F)    
## teach_agg$classtypevar   2   5990  2994.8   9.541  1e-04 ***
## teach_agg$schoolvar     73 138218  1893.4   6.032 <2e-16 ***
## Residuals              259  81299   313.9                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##            1         2
## 2 0.03040029        NA
## 3 0.03040029 0.9159568
##            1  2
## 2 0.03040029 NA
## 3 0.03699641  1
##           1         2
## 2 0.0184982        NA
## 3 0.0184982 0.9159568
##           diff        lwr       upr        p adj
## 2-1 -8.8555145 -14.417275 -3.293754 0.0006293511
## 3-1 -8.4750918 -13.945595 -3.004589 0.0009161656
## 3-2  0.3804226  -5.470231  6.231077 0.9871323318

Grade 3 with grade 3 class type, chool, and teacher id (class grouping)

teach_agg = my_star('g3tmathss', 'g3classtype', 'g3schid', 'g3tchid')
##    g3tchid g3tmathss g3schid g3classtype classtypevar schoolvar  mathvar
## 1 11203810  614.6400  112038           3            3    112038 614.6400
## 2 11203811  608.5417  112038           2            2    112038 608.5417
## 3 11203812  581.3077  112038           1            1    112038 581.3077
## 4 12305610  633.4375  123056           2            2    123056 633.4375
## 5 12305611  613.1000  123056           1            1    123056 613.1000
## 6 12305612  633.6364  123056           3            3    123056 633.6364
fit = aov(teach_agg$mathvar ~ teach_agg$classtypevar + teach_agg$schoolvar);
boxcox(fit)

new_fit = aov(log(teach_agg$mathvar) ~ teach_agg$classtypevar + teach_agg$schoolvar)
boxcox(new_fit)

# do this one custom
teach_agg$g3tmathss = log(teach_agg$g3tmathss)
my_star_stat(teach_agg, 1, new_fit)
##                         Df  Sum Sq  Mean Sq F value   Pr(>F)    
## teach_agg$classtypevar   2 0.01193 0.005966   9.555 9.94e-05 ***
## teach_agg$schoolvar     74 0.25823 0.003490   5.589  < 2e-16 ***
## Residuals              257 0.16045 0.000624                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##            1         2
## 2 0.05785374        NA
## 3 0.01377987 0.6432982
##            1  2
## 2 0.08678061 NA
## 3 0.01377987  1
##            1         2
## 2 0.04339030        NA
## 3 0.01377987 0.6432982
##             diff         lwr          upr        p adj
## 2-1 -0.010646642 -0.01868794 -0.002605348 0.0056764665
## 3-1 -0.013048919 -0.02061253 -0.005485307 0.0001867896
## 3-2 -0.002402276 -0.01090555  0.006100994 0.7834295503

Lets also test our model of interest with a training and testing dataset to furhter evaluate it

## 75% of the sample size
#smp_size <- floor(0.75 * nrow(teach_agg)))

## set the seed to make your partition reproducible
#set.seed(123)
#train_ind <- sample(seq_len(nrow(teach_agg)), size = smp_size)

#train <- teach_agg[train_ind, ]
#test <- teach_agg[-train_ind, ]

7. Conclude your analysis with an discussion of your findings and caveats of your approach.

TODO